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AUTOREGRESSIVE  SPECTRAL  ESTIMATION,  L.H. 
SPECTRAL  SMOOTHING,  AND  ENTROPY' 


Emanuel  Parzen 
Institute  of  Statistics 
Texas  A&M  University 
College  Station,  TX  77843 


Abstract 

Spectral  estimation  Is  motivated  by  Information 
divergence  distance.  Two  methods  of  spectral  estima¬ 
tion  are  developed  in  this  paper:  autoregressive 
spectral  estimation  (section  3)  and  log  spectral 
kernel  estimation  (section  4).  They  are  motivated  as 
parametric  and  non- parametric  estimators  which  mini¬ 
mize  "entropy"  or  "information  divergence"  distances 
between  raw  and  fitted  spectral  densities.  The  role 
of  entropy  concepts  in  the  statistical  estimation  of 
spectral  densities  (section  2)  is  explained  by  con¬ 
trasting  it  with  the  role  of  entropy  concepts  In  pro¬ 
bability  density  estimation  (section  1).  Adaptive 
procedures  for  forming,  and  combining,  these  estima¬ 
tors  for  an  observed  time  series  are  provided  by 
order-determining  and  truncation  (half-power)  point 
determining  criteria,  which  are  described. 

1 .  The  role  of  entropy  concepts  in  statistical 
estimation  of  probability  density  functions. 

Let  X  be  a  continuous  random  variable,  and 

X, . X  a  random  sample  of  X  (consisting  of  indepen- 

1  n 

dent  random  variables  identically  distributed  as  X) . 
The  distribution  function  F(x),  —<*<•,  and  the  proba¬ 
bility  density  function  f (x) ,-“<x«»,  are  defined  by 

F(x)-Pr(X<x).  f(x)-F'(x)  . 

The  entropy  (or  Shannon  information)  of  X  le 
denoted  by  H(£)  and  la  defined  by 

K<n-  £-iog  f(x>  f<*jdx 
-Ef[log  f (X) ]  . 

All  observed  distributions  are  assumed  to  have  finite 
entropy . 

A  maximum  entropy  density  is  a  probability  den¬ 
sity  f(x)  determined  by  maximizing  H(f)  over  all  f 
satisfying  certain  constraints  (usually  involving 
moments  of  f ) . 

Theorem  1A:  Three  Important  densities,  and  their 
characterization  by  a  maxi  mm  sntropy  principle  are: 
(1)  uniform  distribution  over  an  Interval  a  to  b 
maximizes  H(f)  over  the  constraint  that  f  Is  non-zero 
only  on  the  Interval  a  to  b;  (2)  exponential  distri¬ 
bution  with  mean  u  maximizes  H(f)  over  the  constraint 
that  f  Is  non-zero  only  for  x>0,  and  has  mean  u  ;  (3) 

normal  distribution  with  Man  a  and  variance  a2  maxi¬ 
mizes  H(f)  over  the  constraint  that  f  has  mean  u  and 
variance  o2. 

The  maximum  entropy  principle  Is  a  probability 
modeling  principle  in  the  foregoing  examples.  It 
becomes  a  statistical  estimation  principle  (which  fits 
distributions  to  data)  when  the  constraints  on  f  are 
expressed  in  terms  of  settle  means  and  variances;  it 
is  than  similar  to  the  method  of  moments.  A  maximum 
entropy  density  estimator  f  can  be  expressed  in 
symbols: . 

H(f)-max  H(f) 
f 

where  f  Is  constrained  to  have  certain  moments  equal 
to  the  corresponding  sample  moments. 

An  alternative  (and,  we  believe,  more  general) 
statistical  estimation  principle  is  provided  by  the 

This  research  vae  supported  by  the  Office  of  Naval 
Xe  search  (Contract  NOOOU-81-WP-lOOOl ,  AKO  DAAG29- 
80-C0070). 


cross-entropy  H ( g ; f )  and  Informat  Ion  divergence  1 ( g ; f ) 
of  two  probability  density  functions  f(x)  and  g(x). 
Define 

H(g;f)  -  /"  i-log  g(xWf(x)dx 

-Ef j-log  g(X) J; 

I(g;f)  •  r  (-log  *|^-if(x)dx 

■£  [-log 
V  108  f (X)  1 

Note  that  H(f)»H(f;f).  Another  naa*  for  Information 
divergence  is  Kullback-I.lebler  information  number.  A 
minimum  information  divergence  density  g  is  an  approxi¬ 
mator  to  a  specified  density  f  determined  bv 

I(g;f)  •  min  I(g;£) 

8 

where  g  is  constrained  to  belong  to  a  specified  para¬ 
metric  family  of  probability  densities. 

Theorem  IB:  Three  important  examples  of  minimum 
information  divergence  approximators  or  estimators 
are:  (1)  f  is  assumed  to  be  positive  only  over  the 
interval  a  to  b,  and  g  is  any  uniform  distribution; 
then  g  is  the  uniform  distribution  over  a  to  b;  (2) 
f  is  positive  only  for  x^a,  and  has  a  finite  mean  u. 
and  g  is  the  two  parameter  exponential  distribution; 
then  g  is  the  exponential  distribution  with  mean  u  and 
domain  (a,*);  (3)  f  has  finite  mean  u  and  variance 
J2,  and  g  Is  any  normal  distribution;  then  g  is 
N(u,o2). 

Theorem  IB  may  have  been  first  explicitly  formula¬ 
ted  by  Thiel  (1981),  although  it  is  Implicitly  known 
through  the  equivalence  of  maximum  likelihood  estima¬ 
tion  with  minimum  Information  divergence  estimation. 

An  important  observation  by  Thiel  Is  that  Theorem  IB 
can  be  used  to  prove  Theorem  1A,  and  thus  avoid  the  use 
of  the  calculus  of  variations.  We  extend  this  observa¬ 
tion  to  spectral  density  estimation  in  section  2. 

A  minimum  Information  divergence  density  g  esn  be 
expressed  as  a  minimum  cross-enttopy  density: 

H(g; f )  •  min  H(g;f) 

8 

A  cross-entropy  can  be  defined  for  an  arbitrary 
(Including  discrete)  distribution  function  F(x)  by 

H(g;F)  *  [*  (-log  g(x) idF (x) 

Therefore  a  minimum  information  divergence  density  g 
can  be  defined  for  an  arbitrary  distribution  function 
F  by 

H(g;P)  •  min  H(g;F) 

8 

where  g  is  constrained  to  belong  to  a  specified  para¬ 
metric  family  of  probability  densities.  Theorem  IB  is 
true  for  this  definition. 

Consider  now  a  finite  sample  X^,...,Xn  and  a 
parametric  model  f4(x)  for  the  true  probability  density 
f(x).  Indexed  by  parameters  )  which  one  would  like  to 
estimate  from  the  sample.  A  maximum  likelihood  esti¬ 
mator  of  9  is  defined  as  the  parameter  values  t)  maxi¬ 
mizing 

L  “  •  log  fn(X, . X  ) 

n  n  9  l  n 

n 


Let  F(x),  ,  denote  the  sample  distribution 

defined  by 


1 


F (x >  •  fraction  of  X. . ■<  •  x  • 

1  n  - 

One  can  express 

l-n(0)  -  f  Ion  f^fx)  dF(x) 

-  -  H(f0;F)  . 

Therefore  maximum  likelihood  parameter  estimators  9 
yield  minimal  Information  divergence  densities  f^(x). 

By  introducing  ?  to  denote  the  (symbolic)  sample 
probability  density  of  the  sample,  one  can  regard  6 
as  satisfying 

Kfgif)  -  min  Uf9;f)  . 

A  re- interpretation  of  maximum  likelihood  Is  obtained 
by  rewriting  the  Information  divergence  In  terms  of 
quantile  functions  whose  role  In  statistical  infer¬ 
ence  Is  emphasized  by  Parzen  (1979). 

Introduce  the  sample  quantile  function 

q<u)  -  F~l(u)  . 


S(W,  .  !  e-2"lwvR(v) 

v*-® 

.. ,  .  a  -2irlwv 
t(w)  -  e  0(vj 

v®-“ 

The  spectral  distribution  function  is  defined  by 
w 

F(v)  »  /  f(w')  dw'  ,  0  <  w  <  1  . 

0 

When  p (v)  Is  not  assumed  to  be  summable,  there  always 
exists  e  spectrsl  distribution  function  F(w),  O^w^l , 
such  chat  ^ 

p(v)  •  /  e^^dFCw) 

0 

When  o ( v)  la  summable,  It  has  the  spectral  representa¬ 
tion  ^ 

o(v)  •  /  e2ffiwvf (w)dw  . 

0 


Its  derivative  q(u)  •  «}*(u)  satisfies 
q(u)  f(Q(u)>  -  1  . 


Define 


dA(u) 


fq(Q<u>) 

?(Q(u)) 


where  F<j(x)  is  the  distribution  function  with  density 
f„(x) .  Hake  the  change  of  variable  u  •  £(x), 
x  -  (j(u)  to  obtain 

t(fe;f)  -  I  -log  d9<u)  du 

which  one  can  Interpret  as  a  measure  of  how  close  to 
a  uniform  density  la  dQ(u).  The  full  consequences 
of  this  interpretation  are  explored  elsewhere. 

It  should  be  noted  that  one  can  define  other 
measures  to  minimize  to  form  parameter  estimators: 
examples  are 

/0  (d^(u)-l|2du  , 

whose  minimization  leads  to  'modified  chi-square” 
estimators,  and 


1 

/„  (F9((}(u))-ulJdu  . 

whose  minimization  leads  to  "minimum  distance  esti¬ 
mators". 

Information  divergence  Ls  the  measure  that  most 
readily  generalizes  to  stochastic  processes. 

It  should  be  noted  thee  only  the  peremster 
•■MitioB  problem  ls  efficiently  solved  by  minimizing 
I(fq;f).  The  problem  of  goodness  of  fit  is 
solved  by  considering  the  size  of  the  difference  from 
the  uniform  distribution  D(u)  -  u  of  Dg(u)-F0(Q(u) ) 
for  9-4.  The  model  identification  problem  la  to 
find  distribution  functions  #  such  Chet  £(<5(u))  Is 
parsimoniously  not  significantly  different  from  the 
uniform  distribution  D(u)  -  u. 


2.  The  role  of  entropy  concepts  In  statistical  esti¬ 
mation  of  spectral  density  functions. 

Let  Y(t),  t  -  1,  . ..,  T  be  a  sample  of  a  Gaus¬ 
sian  zero  mean  stationary  time  series  with  covariance 
function 

R(v)  -  E(Y(t)r((+v>),  V  -  0.  *1,  ±2 . 

and  correlation  function 

oW>  •  "  Corrjt (t)  ,Y( t+v)  ]  . 

We  assume  R(v)  and  ,3(v)  are  absolutely  suanable. 
and  define  the  power  sgectrvn  S(w).  0  ^  w  ^  1,  and 
the  spectral  density  f(w),  0  ^  w  ^  1,  by 


A  stationary  Ga us a lan  time  aariea  la  callad 
White  nolae  if 

P(v)  •  0,  v  >  0  ; 
f(w)  -  1,  0  <  w  <  1  j 
F(v)  •  w,  0  <  w  ^  1  . 

A  stationary  Gaussian  time  series  with  summable 
correlation  function  end  lntegrable  log  spectral  den¬ 
sity  can  be  represented  in  terms  of  a  whlta  noise  time 
series  e(t)  representing  the  Innovations  [prediction 
errors  Yv(t)  -  Y(t)  -  Yu(t)  of  the  Infinite  memory  one- 
stsp  ahead  predictor  Yu(t)  of  Y(t)J.  The  AR(-),  or 
Infinite  order  autoregressive  representation,  le 

Y(t)+a#(l)Y(t-l)  ...+a>(n)Y(t-n)+...«  e(t)  . 

The  MA(») ,  or  Infinite  order  moving  representation,  la 
Y(t)-e(t)^Jl)e(t-l)+bJ2)e(t-2)+. . . 

A  finite  parameter  representation  is  an  ARMA(p.q)  of 
the  form 

Y ( t )+a  (l)Y(t-l)+...+a  (p)Y(t-p) 

P  P 

•  *(t)+b  (l)c(t-l)+...+b  (q)e(t-q) 

q  q 

The  filter  relating  Y(t)  end  e(t)  Is  celled  a 
whitening  filter.  Parameter  estimation  ls  the  theory 
of  estimation  of  tha  parameters  of  the  whitening  filter 
and  model  Identification  Is  the  theory  of  estimation  of 
tha  structural  form  of  the  whitening  filter.  To  deve¬ 
lop  approechee  to  parameter  estimation  for  e  random 
sample,  in  section  1  we  defined  the  following  concepts: 

Entropy  H(f)  , 

Maximum  entropy  density  f  , 

Cross-entropy  H(g;f)  , 

Information  divergence  l(g;f). 

Minimum  information  divergence  density. 

Minimum  cross-entropy  density. 

Likelihood  of  e  sample  , 

Maximum  likelihood  peremster  estimator. 

To  develop  approechee  to  estimation  of  the  para¬ 
meters  9  of  a  parametric  modal  fe(w)  of  the  spectral 
density  f(v)  of  a  stationery  zero  mean  Gaussian  time 
series  Y(t),  we  develop  analogues  of  the  foregoing 
concepts.  We  start  with  an  approximate  formula  for  the 
likelihood  function 

f  io*  t,mi> . tom 

of  the  time  series  sample.  Via  assist  that  Y(t)  has 
bean  divided  by  (R(0))H  so  that  it  can  be  considered 
to  have  variance  1,  end  its  coverlancs  function  equals 
Its  correlation  function. 


2 


The  first  step  in  analyzing  a  time  series  should 
be  to  compute  the  sample  correlation  function 
T-v  T 

i(v)  «  Z  Y(t  )Y (t+'v)  i  :  Y‘(t) 

t-l  t-1 

and  the  sample  spectral  density 

i (w)  -  .  :  Yft)e*2"lwti2  *  :  Y2(t) 

t-l  t-l 


-  -2*ivv 

4  e 

!v|<T 


o(v)  . 


It  should  be  emphasized  that  in  practice  one 
should  consider  using  a  "data  window"  to  compute  f (w) , 
for  u  •  k/Q *  k  •  0,1,..., Q-i,  by 

f(w)  -  ji(w)|2 

w  Jc-0  w 
T 

4»(w)  -  Z  Y( t) K(—)exp (-2*lwt ) 
t-l  X 


where  hero  f  denotes  the  true  probability  density  of 
the  sample,  and  f ^  is  a  model  for  f.  It  should  be 
noted  that  we  are  using  the  notation  f  and  with  a 
variety  of  meanings.  For  a  Gaussian  zero  mean  station¬ 
ary  time  series,  the  probability  density  of  the  sample 
is  specified  by  the  spectral  densities.  f(w)  of  the 
true  distribution  and  fq(w)  of  the  model.  We  continue 
to  denote  the  information  divergence  by  If  (fg;f)  but 
now  f  lndlcaten  a  spectral  density  rather  than  a  prob¬ 
ability  density.  Pinskar  (1963)  proves  a  formula  for 
If  (f^if)  In  the  limit  as  T  * -  : 

Urn  lT  ( f g ;  f )  -  I  Cfe;  f ) 


where  is  the  Information  divergence  defined  as 

foLlows. 

For  two  spectral  Jensitles  f  and  g,  the  informa¬ 
tion  divergence  l(g;f),  cross-entropy  H(g;f),  and 
entropy  H(f)  are  defined: 


/ 


0 


,f(w) 

g(w) 


log 


t(w)  _ 

g(v) 


1) 


dw 


for  a  suitable  kernel  K(x)  (properties  of  windows  are 
discussed  in  Harris  (1978)).  In  addition  for  statis¬ 
tical  stability  one  should  then  slightly  smooch  f(w): 
(1)  compute  the  sample  correlation  function  by 

i  <M  k  .  k 

0(v)  -  i  Z  exp(2iri§v)  f(£)  . 
v  k-0  V  W 

which  holds  for  0  <_  v  <_  Q-T  (and  therefore  one  may 
want  to  chooee  Q  _>  2T) ;  (2)  compute  a  slightly 

smoothed  sample  spectral  density  by 

f(v)  -  Z  exp (-2t ivv)  k(^)o(v) 

|  v  i  <T  M 

where  M  _>  T/2  and  k(u)  Is  a  suitable  kernel,  such  as 
the  Parzen  lag  window: 

k(u)  -  1  -  6u2  +  4|u|3  .  I ul  '  0.5  . 

-  2(1  -  |u|)3  ,  0.5  <  |u|  <  1 

-  0  ,  |u|  »  1  . 

Back  at  the  likelihood  ranch,  one  may  show  chat 
approximately 

-1^.(8)  -  |  log  2*  +  H(f0;f) 

where  ^ 

H(f0;f)  -  \  /  (log  f  (w)  +  )dw 

0  9  ' 

This  formula  for  likelihood  shows  chat  the  sam¬ 
ple  spectral  density  f(w)  is  a  sufficient  statistic 
for  a  time  series.  However,  it  Is  a  very  wiggly  func¬ 
tion  and  by  Itself  is  not  a  consistent  estimator  of 
f(w).  Estimators  ?(w)  of  f(w)  can  be  regarded  as 
"smoothings"  of  f(w),  but  the  basic  problem  is  how 
much  to  smooth. 

Another  aspect  of  the  likelihood  formula  Is  Us 
justification  as  an  approximation.  To  those  misguided 
analysts  for  whom  max in ua  likelihood  provides  the 
ultimate  estimator  for  which  no  expanse  should  be 
spared,  there  ia  no  substitute  for  the  exact  likeli¬ 
hood  (which  of  course  is  exact  only  If  the  model  being 
assumed  la  exactly  true).  Information  concepts  enter 
estimation  theory  when  one  recognizes  that  maximum 
Llkallhood  estimation  La  a  technical  device  for 
carrying  out  minimum  information  divergence  estima¬ 
tion.  The  Information  dlverganca  for  a  sample  YU), 
t  •  1,  ....  T,  ia  defined  in  general  by 

f9(Y(l) . Y  (T) ) 

f(Y(l) .  ’ 


■H(g; f )  -  H(f ; f )  , 

H(g;f)-|  /  (log  g(w)  «■  dw  , 

H(f)-H(f;f)-4  /  (log  f(w)  +  1)  dw  . 

"  0 

Since  u  -  log  u  -  l  »  0  for  all  u,  l  has  two  of  the 
properties  of  a  distance:  l(g;f)  0  ,  l(f;f)  ■  0  . 
However  I  does  not  satisfy  Che  triangle  inequality. 

The  Information  divergence  can  be  related  to  the 
L2  log  spectral  density  distance 

L-L(f,g)  •  /  f Log  f(w)  -  log  g(w)}2  dw  , 

0 

using  the  fact  that  u  •  exp(log  u)  •  1  ♦  log  u  ♦  (>5) 
(log  u)2  .  When  f  and  g  are  "neighbors"  In  the  aensa 
that  their  ratio  approximates  1, 

I(g;f)  -  j  L2L(f,g)  ; 

then  minimizing  l  Is  equivalent  to  minimizing  L.L. 

An  extensive  discussion  of  these  distances  Is  given  by 
Gray,  Buzo,  Cray,  and  Matsuyama  (1980). 

The  concepts  have  now  been  defined  to  state  some 
of  the  basic  fact9  of  parameter  estimation  theory. 

Maximum  likelihood  estimators  9  are  equivalent 
to  sample  minimum  cross-entropy  estimators  9  defined 
bV 

H(f,;f)  -  mtn  H(ffl;f) 

They  -'.an  be  regarded  as  estimators  of  the  population 
mlnlmtn  cross-entropy  "parameters"  0* defined  by 

H(fg*;f ) -  min  H(fgjf) 

where  f  is  the  true  spectral  density. 

A  maximum  entropy  spectral  density  f  is  defined 
by 

H(f)  -  max  H(f) 
f 

where  f  La  constrained  to  satisfy  a  set  of  constraints 
of  the  form 
1 

/  i>  (w)  f(w)  dw  -  C  .  J  •  l . M, 

0  J  1 

for  M  specified  functions  jr.(w)  and  constants  C  ^  ■ 

When  the  constraints  are  of  the  form 

/  «2w|w,f(w)  dw  -  ,.(J),  J  -  0.  f  1 . ♦  a 

0 


K(lo« 


Y(T» 


it  can  be  shown  that  f{v;  is  the  autoregressive  spec¬ 
tral  density  f  <w)  defined  as  follows: 


whose  spectral  density  is  given  by 


•  >  ,  2t» tw.  ,  - 


f  <w) 

m 


~7  ! *  (e‘  "1wj  l 


f  (w) 


where 


.  LSzl 

(  (V) 


g  (*)  ■  l  ♦  .»  (1)  z  *■  .  .  .  *■  a  (m)z  ; 
oi  at  m 

the  autoregressive  coefficients  a  (l),...,a  (a) 
satisfy  normal  equations  (called  ‘Nule-Walkir 
equations) 

o 

J!  a  (k)p(k-J)  -  0,  1*1,  ....  o  , 
k-0  0 

where  a  (0)  »  1  ;  and 

a 

"  a„(k>  o(k) 

■  k-0  * 


It  should  be  noted  that  from  a  sequence  p(v), 
v  *  0*  +  I,  ...  one  can  quickly  compute  fa(w)  for  all 
successive  values  of  m  -  1,2,...  using  a  variety  of 


fast 

the 

algorithms  (see  Kaliath  (1974)]. 
problem  is  to  determine  "optimal" 
Some  important  properties  of  f 

1  ® 

In  practice 
values  ot  m. 
(w)  are: 

(i) 

/  f  (w)dv  -  o(J),  J-0, 

0  " 

♦  l . ♦  m 

(2) 

tl  dw  -  L  J 

'  0  Vu) 

1 

(3) 

1 

/  log  f  (w)  dw  *  log  <j2 

o  ra  ■ 

(4) 

H(f  ;f)  -  H(f  )  •  7  <  log  o 2  4-  l) 

m  m  2  a  m 

1 

^m  *  l  f(w)  dw 

(5) 

-  aln  /  |  l*c,e2nV. .  .+c  e^TTlwm|  2f  (w)  dv  ; 

c, . c  0  1  " 

l  is 

(6 )  g1B(z)hasall  its  roots  in  the  coaplex  plane  out¬ 
side  the  unit  circle; 

(7)  j^lra  log  s ^  a  ^0g  ,2. 

log  j2  *  /  log  f(w>  dw  ; 

0 

(9)  1 *  *  iB  differentiable  (the 

race  of  convergence  of  f  (w)  depends  on  the  rate 
of  convergence  of  i2  to  fl‘  ) ; 


A  "model  Identification"  determined  order  a  has  the 
property  that  the  spectral  distribution  function 

F  (u)-  ft <v')  dw‘  ,  0  <  w  <  1  . 

m  0  tn  -  - 

is  parsimoniously  not  signif icantly  dlffersnt  from  the 
uniform  distribution  Fq(w)  -  w  representing  the  spec¬ 
tral  distribution  function  of  white  noise. 

In  deriving  autore  resslve  spectral  estimators 
or  approximators,  ve  ha\  so  far  developed  an  analog 
of  Theorem  1A,  by  stating  that  autoregression  provides 
maximum  entropy  estimators  subject  to  the  constraint 
that  certain  correlation  values  are  attained.  We 
prefer  an  analog  of  Theorem  IB,  which  states  that: 
the  minimum  information  divergence  parameter  estima¬ 
tors  of  an  autoregressive  modal  for  the  true  spectral 
density  are  provided  by  the  coefficients  which  satisfy 
the  Yule-Walker  equations. 

A  very  important  Tict  (that  may  not  be  widely 
known)  Is  that  the  maxlmm  entropy  properties  of  auto¬ 
regressive  spectral  densities  follow  from  their  mini¬ 
mum  Information  divergence  properties,  using  the  fact 
that 

I(f  ;f)  -  H(f  )  -  H(f )  >  0  ; 

mm  — 

consequently  H(f)  H(fm).  Since  f  and  fm  satisfy 
the  constraint  that  their  first  m  correlations  equal 
specified  values,  the  entropy  H(f)  achieves  Its  maxi¬ 
mum  value  at  f  -  t*  . 

m 

3.  Autoregressive  spectral  sstlmators  and  order 
determining  criteria. 

Given  a  sample  Y(t),  t  -  1,  ...»  T,  there  are 
many  approaches  for  forming  autoregressive  spectral 
estimators,  because  (as  summarized  in  Parzen  (1981)] 
there  are  four  equivalent  ways  of  parametrizing  them: 
(A)  autoregressive  coefficients,  (B)  correlations, 
(C)  partial  correlations,  and  (D)  innovation  vari¬ 
ances.  Here  we  only  consider  starting  with  the  sample 
correlations  o(v).  Then  for  m  -  1,  2,  ...  one  forms 

I 8d<"> I  ‘  • 

where 

g#(*)  •  1  +  a^d)  z  *  ...  +  an(m)z®  ; 


(10>  2I(fa;f)  -  log  •»*  -  log  j*  •  0  a  in  *  » 

The  foregoing  facts  explain  why  autoregressive 
spectral  approximations.  Introduced  in  Parzen  (1968), 
(1969),  provide  powerful,  and  natural,  estimators  of 
an  unknown  spectral  density.  They  are  generated  by 
the  "maximum  entropy  approach"  Introduced  by  Burg 
(1967).  However  the  Burg  algorithm  does  not  compute 
the  autoregressive  coefficients  ua(j)  and  the  innova¬ 
tion  variances  <j£  by  the  Yule-Ualker  equations. 

Indeed  it  does  not  compute  either  p(v)  or  ?(w).  U 
does  not  provide  Insight  Into  how  to  Identify 
"optimal"  autoregressive  orders  a. 

•)ne  approach  to  defining  criteria  for  an  optimal 
order  m  Is  to  examine  how  well  one  has  transformed  to 
white  noise  the  residual  series 


the  sample  autoregressive  coefficients  a  (j)  satisfy 
the  sample  Yule-Walker  equations  m 

m 

Z  a  (k)  p(k-J)  -  0,  j  -  l . m  . 

k-0  ® 

where  3^(0)  -  1;  and  the  order  m  innovation  variance 
m 

c2  *  Z  a  (k)  p (k)  . 

®  k-0  ■ 

Define  by 

log  j2  •  /  log  ?(w)  dw  . 

0 

Then  as  a  tends  to  T, 


n  (t>  •  Y(t)  a  U)  Y(t-l)  4-... 4-  i  (m)Y(t-a) 


2 £ ( f  ;f)  -  log  j2  -  log 

m  a  - 


0 


W«  desire  to  be  a  sequence  of  consistent  esti¬ 
mators  of  f  In  the  sense  chat  if  one  chooses  m  as  .1 
suitable  function  of  T,  then  as  T  -  » 

Iff  ;f)  -  0  and  f  fw)  *  f(v)  , 

m  a 

In  probability,  or  with  probability  one,  or  In  mean 
square.  The  first  rigorous  proof  of  such  results  was 
given  by  Berk  (1974)  who  also  finds  the  asymptotic 
variance  of  fa(v),conf  irmiag  conjectures  in  Parcen(196^. 

We  now  consider  the  problem  of  choosing  m  adap¬ 
tively  from  a  sample  of  stae  T.  Conceptually  one 
would  like  to  choose  a  to  minimize  One 

approach  to  such  a  procedure  la  given  by  Akalke  (1974) 
and  leads  to  an  order  determining  criterion  called  A 1C. 

The  Akalke  Information  criterion  computes  tor 

n  •  1,2,  . . . 

AIC  (■)  -  log  *  IS 

and  an  optimal  order  m  satisfying 

AIC(m)  -  min  AZC(o) 
m 

The  optimal  order  m  can  equal  0,  indicating  that 
the  time  series  is  white  noise.  The  value  of  AIC(O) 
can  be  adjusced  to  the  value  one  desires  for  the  pro¬ 
bability  of  rejecting  the  hypothesis  of  white  noise, 
when  In  fact  the  time  series  is  white  noise.  We 
recommend 

AIC(O)  -  -  i  . 

Parren  (1974),  (1977)  proposes  autoregressive 
order  determining  criteria,  called  CAT,  whose  founda¬ 
tions  are  different  from  those  of  A IC  but  which  usually 
lead  to  exactly  equivalent  orders  in  practice.  The 
time  series  model  identification  problem  lo  to  esti¬ 
mate  the  infinite  autoregressive  transfer  function 


an  estimator  <if  the  terms  In  J  which  depend  on 
thus  one  minimize* 


is  an  "unbiased"  estimator  of  .  At  ra  ■  0,  we 
assign  CAT ( 0 )  -  -  (1  ♦  (1/T)).  * 

It  should  be  noted  that  a  multiple  time  series 
version  of  CAT  is  given  In  Parzen  (1977). 

An  order  determining  criterion  which  Is  consis¬ 
tent,  but  whose  behavior  in  practice  Is  controversial. 
Is  given  by  Hannan  and  <}ulnn  (1979). 


4.  Lon  Spectral  Kernel  Estimator  and  Cepstral 
Correlations 


The  approach  we  have  been  describing  for  forming 
"optimal"  estimators  f (w)  of  the  spectral  density  f(w) 
of  a  stationary  time  aeries  is  to  view  f (w)  as  a 
function  closest  to  ?(w)  in  a  distance  between  spec¬ 
tral  densities  given  bv  the  Information  divergence 
I(f;f).  The  class  of  functions  from  which  i(v)  is 
chosen  has  been  constrained  (or  specified)  parametri¬ 
cally,  In  the  sense  that  f(w>  is  of  the  form  f*(v)t 
where  9  estimates  the  parameters  9  of  a  model  tQ(w) 
for  the  true  f(w). 

A  non-paramecric  constraint  is  to  Impose  a 
smoothness  measure  on  f  such  as  the  integral  square 
of  the  r-th  derivative  of  log  f(w),  denoted 

/  | (log  f (v)>  ^r) [ 2  dw  . 

0 


g„(i)  -  1  +  a^d)  s  ♦  ...  +  a^(n)  1  «■  ... 

by  a  sample  order  m  autoregressive  transfer  funcclon 
?*<*>•  To  evaluate  the  overall  mean  square  error  it 
Is  convenient  to  define 

J«  ■  I  d?  g„<«2"lw>lzf<w)4v 


which  can  be  shown  to  be  the  sum  of  a  variance  term 
E/  1  J!  *>('2’1W)|Z  <<v)dv 


and  a  bias  term 


/  1 77  ».<«2'lW>  -  ~T  «.(«2"tW)|z  fW)dw 

0  m  • 


One  then  seeks  to  choose  f  to  maximize  smoothness . 
while  minimising  a  measure  of  distance  of  f  from  i. 
Wahba  (1980)  Introduces  the  estimation  distance 

1  ,  1  ,  . 

/  |T*Qg  f(w)  -  log  f(w)|2dwHc/  |  (log  f(w))U,|idw 
0  0 

where  K  la  a  penally  parameter  to  be  determined  adap¬ 
tively  by  the  data.  One  may  show  that  the  resulting 
estimators  of  g(w)  *»  log  f(v)  are  of  the  form,  called 
log  spectral  kernel  estimators, 

g(w)  -  (log  f(v»-  1  exp<-2irlw)k(~)  r(v) 

v* 

where 

1 

y (v)  -  /  exp  (2wlwv)  log  f(w)  dw 
0 


One  can  show  that  the  variance  term  is  approximately 


1 

T 


I 

i-l 


o 


-2 

3 


are  called  cepstral  correlations,  end  the  kernel  k(x) 
is  given  by  ? compare  Parian  (1958)) 


One  often  considers  only  two  values  for  r,  2  and  4. 

The  statistical  properties  of  cepstral  correla¬ 
tion  have  been  extensively  in vestige ted  by  fthanaall 
U»7d). 

Sine,  -  J  for  V  -  M,  y,  ell  M  ch«  "holt 

pov.r”  i.g .  w*  itik  to  ,d,ptlv,ly  d,t.rmln«  M  fro, 
th,  ,u,L.  to  tzlnlmix,  th,  risk  function 

-  J(f;f)  -  E  L2L(f,f) 


issuaing  Log  f(w)  has  ■»  representation 

*(w)  -  log  f  (w)  •  exp(-2^iwv)  y{v)  . 
v»-<» 

Following  Wahba  (1980),  to  minimize  R^  one  minimizes 
an  estimator  of  It  of  the  form 

^  -  8(M)  ♦  V(M.T) 

where  8(M)  and  V(M,T)  are  measures  of  bias  and  vari¬ 
ance  given  by 

BOO  -  i.  i  <Y(v>)Vr'l  +  (£)2rf2  . 

M  M<t/:  M 

V(  N.T)  ■;'/  l  f  (1  +  u2Vl  du  . 

'  6  0 

A  closed  fora  evaluation  of  the  Integral  in  V(M,T) 
can  be  obtained. 

S.  Iterated  spectral  estimation. 

Observed  time  series  do  not  usually  obey  the 
assumptions  made  in  the  foregoing  theory  that  Y(t)  Is 
a  zero  mean  Gaussian  else  series  with  suamable  corre¬ 
lation  function.  We  call  such  a  time  series  a  "short 
memory"  time  serves  (of  which  white  noise  Is  a  special 
case,  called  a  "no  memory"  time  series).  Otherwise 
the  time  series  is  called  "long  memory"  (Parzen (1982)). 

Autoregressive  spectral  estimators  are  especial¬ 
ly  suitable  for  matching  the  large  scale  oscillations 
of  the  spectral  density  of  a  long  memory  time  serLes. 
The  role  of  the  autoregressive  filter  is  then  to 
transform  the  time  series  to  a  short  memory  time 
series  {obtained  as  the  residuals  k-m(t)  described  in 
section  21.  The  spectral  density  of  the  short  memory 
scries*  which  can  be  regarded  as  the  fine  structure 
of  the  original  spectral  density,  can  be  estimated  by 
a  Log  spectral  smoothing  estimator  as  well  as  by  an 
aucoregresslve  spectral  estimator.  Employing  two 
different  approaches  to  short  memory  spectral  estima¬ 
tion  is  desirable  since  the  problem  of  spectral 
estimation  is  not  simply  a  problem  of  parameter 
estimation  but  is  also  one  of  model  Identification. 

Iterated  models  for  forecasting  long  memory  time 
aeries  are  used  by  Parzen  (1981)  under  the  name  of 
"ARARMA  models”  (see  Appendix  for  an  example). 
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APPENDIX: 


Wolfer  Sunspot  Numbers  1846-1963 . 


To  llluacrste  the  application  of  some  of  the 
foregoing  ideas,  we  report  an  Iterated  autoregressive 
sudsl  fitted  to  the  annual  time  series  Y(t)  of  Uolfer's 
sunspot  data  for  the  years  1846-1963  (which  is  a  sample 
of  length  T  •  118).  Our  ARARMA  taodel  fitting  algo¬ 
rithm  automatically  proposes  the  following  model  (which 
It  hopes  will  have  the  best  medium  range,  If  not  long 
range,  forecasting  capability ) : 


YU)  -  Y (t)  -  .482  Y(t-10)  -  .554  Y(t-U) 
YU)  -  1.009  YU-1)  +  .36 2  Y (t-2)  •  t(t) 


The  series  Y(e)  is  a  short  memory  time  series  to  which 
Y (t)  has  beeu  transformed  by  the  Initial  autoregress¬ 
ion  on  YU).  As  an  sstlsucor  of  the  true  log  spectral 
density  f(w)  we  take,  up  to  a  normalizing  constant. 


Is  the  autoregressive  spectral  density  corresponding 
to  the  transformation  from  Y(t)  to  YU). 

Flgurs  1  Is  s  graph  of  the  Wolfer  sunspot  data 
(the  crosses  represent  the  one-step  ahead  predictors 
of  the  model  above).  Figure  2  graphs  the  Iterated 
autoregressive  log  spectre l  estimator  flog  f^(w))‘ 
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